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Abstract — In most of the cases deployment of wireless 
sensor nodes is at remote location, where failure detection & 
recovery is a crucial task. It’s very challenging to observe the 
effect of failed node on network. Recovery of faulty node as well 
smooth functioning of network till recovery of faulty node is 
interesting topics for research. Plug out of faulty node and plug 
in of recovered node affecting configuration and working of 
network are demanding areas in fault management. 

The aim of this paper to study state of the art research 
solutions to detect, recover faulty nodes in wireless sensor 
nodes. In addition we also identified strengths & weaknesses of 
these solutions. Further we provide our observations & 
directions for the scope of research work for future 
improvements. This paper may be a good starting point for 
those who want to pursue research in fault management area of 
wireless sensor network. 

Index Terms — fault diagnosis, fault recovery, fault tolerant, 
fault Management 


I. Introduction 

Wireless sensor networks are mainly designed and deployed 
for monitoring events and surveillance applications. The 
application might be alarm indications in disaster areas such 
as tsunami, earthquake, forest fire detection etc. [4] [12]. In 
most of the above application, network need to be deployed 
remotely. After deployment, manually monitoring these 
networks seems to be impossible because of its remote 
locations. Algorithms working in centralized or distributed 
manner [9] [11], should take care about detection of such 
faulty nodes. Once detected, these faulty nodes should not 
affect accuracy of the final result to great extent, as it may 
hamper the purpose of the deployed network itself. Ideally 
fault management algorithm should take care of Fault 
identification, removal, recovery from the fault & try to avoid 
it in future. 


output, identify the cause of fault occurred, corrective actions 
for recovery of the faulty node. There are two major ways. 
Centralized or Distributed approach. In distributed 
approach [ 1 ] [2] [3] [6] [7] [8][13][14][15][16][17] node can 
independently (without repetitively consulting central 
authority) self-monitor or self-detect faults. In Centralized 
approach [5] [6] central node frequently monitors node status 
by injecting requests in wireless sensor networks. Based on 
available database, administrator may identify any 
unpredicted observations. The detail analysis may indicate 
failed or suspicious nodes. 



Healthy Node 




Gateway Node 


Healthy Link 


Faulty Node 


Sink 


Faulty Link 


Fig. 1 Faulty link, Faulty node in Wireless Sensor Network 

Our contribution of the paper is to summarize the state of art 
extensive survey in the area of fault management. We also 
analyze the strengths of existing fault management 
techniques and their weakness. Moreover, we identify 
further research scope in fault management area. 


Indian National Center for Ocean Information Services & 
Earth System Science Organization Under Ministry of Earth 
Science, Government of India in association with National 
Disaster Management Authority, Annual Report 2012-13 [4] 
contributes a lot to motivate research in disaster management 
area. 


II. RELATED WORK 

Many efforts have been taken by various researchers in 
different domain like neural network[7][8][15][16][17][18], 
communication! 1][ 14], node monitoring & data aggregation 
techniques[5] [6] [ 10] [ 13]. 


Basically fault management [9] [11] includes faulty node 
identification, studying effect of faulty nodes on expected 
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We shall categorize the fault management research papers 
based on domain. 

A. Based on Neural Network domain 

B. Based on Data Aggregation Techniques 

C. Based on Communication domain 

D. Based on approach of providing end to end solution 


115 


www.erpublication.org 




A Survey of Fault Detection and Management Techniques in Wireless Sensor Networks 


A. Analysis about papers based on Neural Network 
domain: 

Fuzzy logic data fusion [7], Hierarchical Bayesian 
Space-Time framework [8], tracing techniques[16], decision 
fusion [17], Recursive Principal Component Analysis tool [18] 
are various Artificial Neural Network Techniques used by 
following research papers. We shall discuss strengths and 
weaknesses of each paper in detail. 

[7] [2010] Jethro Shell et al applies fuzzy logic data fusion 
approach to fault detection within a Wireless Sensor Network. 
The framework proposed by author is a combination of 
Control chart, one of the prominent monitoring tools in 
Statistical Process Control and Clustered Covariance Mean 
Fusion, Kalman filter for finding the most optimum 
averaging factor for each consequent state in dynamic, noisy 
wireless sensor network. 

The framework observes subgroups, cluster heads, lower & 
upper threshold limits, find covariance in the network. This 
will help in reducing uncertainty & false positives within 
fault detection process. The author verified all results on 
simulated environment, implementing the same on real 
deployed sensor network may help for realistic verification. 

Hierarchical Bayesian Space-Time theory can be used to 
predict about the likelihood of something happening in 
uncertain situations. Kevin et al [8] suggested Fherarchical 
Bayesian Space-Time modeling (FIBST) two-phase modular 
fault detection framework. 

Blind modeling is the first module in the fault detection 
pipeline. Data from all sensors within the audit time window 
is modeled assuming all nodes are healthy. The output of 
blind modeling is given to trusted sensor selection where 
accurate model of expected behavior is found out. The second 
phase is reevaluation module which takes the sensors that are 
marked as trusted and the output is used further for fault 
decision. Huge set up is required for establishing HBST 
model before deployment and lot of computational power 
required to estimate the parameters of the model. 

Same kind of Distributed Practical framework of the 
Bayesian approach is used for specific chemical product 
stores by Sourour et al [15]. It is used to identify failed nodes 
with certain level of performance and fault tolerance. Two 
principle concepts, first sensory threshold (relative to the 
sensor) or "likelihood ratio threshold" and second decision by 
minimizing the probability of detection error. 

Tracing technique is used for structural health monitoring by 
author. Vinaitheerthan et al [16] suggest novel efficient 
tracing technique that encodes and records the inter 
procedural control flow of all interleaving concurrent events. 
The traces means Replay debugging tool includes ordering 
sequence of events, control flow path taken, input values etc. 
Authors have verified runtime overhead, energy consumed 
etc on test bed as well actual Golden Gate Bridge monitoring. 


Major limitation is energy overhead due to tracing. The 
authors approach is suitable for Event-driven operating 
system such as TinyOS. 

Considering the same application of structural health 
monitoring another artificial neural network concept, 
decision making can be used. Decision-making is 
recognizing and choosing alternatives based on the values 
and preferences of the decision maker. Concept of decision 
fusion believes, if a sensor node gives different decision about 
the occurrence of events from others, it is assumed as faulty. 
Value decision is, if expected output is n bit & actual output 
varies from the said no of n bits then declares it as faulty. X. 
Liu et al [17] focused on “faulty sensor reading” type of fault 
which is commonly occurring but difficult to detect. 
Structural characteristics, namely natural frequencies and 
mode shapes are used for decision making. 

But in case of structural health monitoring where expected 
output might be in analog form so value decision may not be 
applicable. The model has limited scope in structural health 
monitoring only. 

Another Neural Network concept is Recursive Principal 
Component Analysis tool. Recursive Principal Component 
analysis is a statistical procedure that uses an orthogonal 
transformation to convert and reduce a set of observations of 
possibly correlated variables to a few principal components. 
Xie et al [18] proposed Recursive Principal Component 
Analysis tool stores & updates standard normal behavior of 
nodes. Algorithm executes & compares behavior of network 
at that instance with stored one. If after comparison detected 
result is normal data about behavior is updated with recent 
one. If detected result is not normal it will release fault alarm. 

Author evaluates the performance of RPCA using dataset 
from a real-world deployment in the Intel Berkeley Research 
Lab & not implemented algorithm on actual test bed. 

B. Analysis about papers based on Data Aggregation 
Techniques: 

Data aggregation algorithms in wireless sensor network will 
help to collect and aggregate data so that less energy is 
consumed and network lifetime can be extended. Fatima et al 
[5] proposed Adaptive Neighborhood Failure Detection 
mechanism framework. The NFD mechanism has adaptive 
timers to detect nodes crash due to failure of radio links. Sink 
send Query packet and wait till Failure Detection Timeout 
(FDT) for getting a response. Each node keeps a counter, 
which is incremented at every round. Two possibilities about 
node status are “Suspected” means suspected of being faulty, 
and “Mistake” means correction of earlier false suspect status. 
It is value addition of fault detection logic in Directed 
Diffusion routing protocol. Lrequently broadcasting Query 
packet & its response may increase network traffic leading to 
reduction of network lifetime. 

In data aggregation techniques deployment of nodes also 
carries lot of significance. If nodes are deployed with large 
distance in between it may create many small networks 
without connectivity [6] in between these networks. If nodes 
are densely deployed it may cause lot of redundant result 
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resulting in higher cost. So it is quite challenging to optimize 
no of nodes/cluster (node density) [6] 

Habib et al [6] proposed distributed protocol for 
heterogeneous, static as well as dynamic nodes to lc-cover a 
Region of Interest Rol. The algorithm reduces the energy 
consumption due to communication and mobility of sensor. 
Static sink sends monitors all mobile proxy sinks by sending 
Query packets, mobile proxy sinks compare with their own 
density of nodes & based on the results decides to merge or 
lead. 

The strength of this paper is that author tried to give end to 
end solution. End to end means from architecture, on 
demand k coverage protocol to data-gathering algorithm. 
Step one is author suggested four-tier architecture containing 
one static sink which is central co-coordinator, mobile proxy 
sinks, mobile sensors, and static sensors. In next step author 
proposed on demand k-coverage protocol that exploits sensor 
mobility to achieve k-coverage of any region of interest. On 
top of k-covered network configurations, author proposed 
two data-gathering protocols that use mobile proxy sinks to 
deliver the collected sensed data to the static sink. 

The limitation of the paper is that author considers only one 
event at a time & only one static sink whereas practically this 
is almost impossible. In direct data gathering method, 
distance to a leader mobile proxy sink may not be shortest 
distance so end to end delay may be increased. In chain based 
data gathering approach failure of leader mobile proxy sink 
may cause data loss as there is no redundant way to connect 
with static sink. 

Peng Yu et al [13] Node Self Detection by History data and 
Neighbors (NDHN) collects the characteristics of the nodes. 
The historical data is used to compute & take decision. 
Author simulated but implementing it in real world scenario 
will be challenging. 

Both Md Zakirul et al [10][2013] & Peng Yu et al [13][2011] 
insist on node monitoring concept. [13] suggest every node to 
get neighboring nodes' measurement, concatenate, analyze it 
& forward faulty/healthy decision till sink. [10] suggest 
individual node to self-monitor, associated link monitoring 
& through co-coordinator node send status report till sink. 

C. Analysis about papers based on Communication 
Domain: 

In communication domain we have considered papers in 
which in addition to existing frame format some addition 
tags are added for fault detection and management. 

Abu et al [1] suggest to add In-network Packet Tagging!IPT) 
in every node. Every node add its own path checksum tag 
with each data packets going to the sink node. While 
traversing through path till sink, each node in the path 
update the tag with its own node ID by means of the Fletcher 
checksum algorithm. Once packet arrived at sink path 
checksum is cross verified with stored path checksum in 
Network Data Base. Initially sink performs Network Path 


Analysis (NPA) & stores all network paths & their respective 
path checksums in Network Data Base (NDB). If there is 
variation in NDB & arrived checksum. Fault Detection and 
Identification module (FDI) will send control message to 
affected node. 

Above Sequence Based Failure Detection framework detects 
faults in networks with periodic data transmission to the 
sink. Author injected failures in real test bed network. Then 
they try to calculate out accuracy by ratio of total number of 
faults detected by algorithm to actual injected failures. 
Author claims that Sequence Based Failure Detection is 
lightweight, accurate, and scalable. But sink has to store 
huge path checksum database, computation & verification of 
path checksum, initialization & execution of all failure 
detection & identification procedures too much of overheads 
are there. 

Analogous to Election algorithms in distributed operating 
system concept of voting is used by Shahram et al [14]. The 
network is divided into two groups of clusters namely 
downstream group and upper group. Each group consist of 
several clusters. Each cluster have one voter node. In each 
cluster the voter nodes do voting operation about sensed data. 
The voter nodes in upper group forward the data received 
from downstream group. 

Many points like how clusters are formed, who will decide 
voters, voter node forwards data so what will happen with 
that etc are not clear. Author claims that their algorithm can 
detect fault and recover fault in decentralized way. 

D. Analysis about papers based on providing end to end 
solution: 

Few authors have proposed end to end solution i. e. from 
architecture till application. Algorithms included in these 
papers provide solution for all probable problems in network. 
Dima Hamdan [2] proposed an integrated fault tolerance 
framework (IFTF). The framework first diagnose network 
faults which are likely to happen in WSN deployments; 
second quickly assess the impact of faults on the whole 
system behavior; third improve the fault detection rate by 
detecting some hidden causes of faults (silent or predefined 
faults); and fourth validate the application after code 
upgrades or any changes in the operating conditions. IFTF 
Manager initiates the two services. Application Testing 
Service & Network Diagnosis Service. Application Testing 
Service is like black box testing. Services are tested by 
feeding the nodes with test inputs and examining the outputs 
to compare them to the expected ones. Network Diagnosis 
Service consists of location detection phase & consensus 
phase means monitoring node sharing its findings with other 
neighbors. The algorithm proposed by author is restricted to 
mining operation. 

Dima et al [3] proposed layer independent adaptive and 
efficient approach for fault diagnosis in WSN. “SMART” 
layer works in between operating system & application 
classified into 4 major components. It executes first failure 
detection phase consisting of location detection phase & 
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Parameters »»> 

> 

/ ref paper 

Centralized/dist 

ributed 

Mobility of nodes 

Mobility of sink 

Link 

quality 

Congestio 
n level 

Fault detection / 

correction/ 

tolerance 

[1][2014] Abu et al 

distributed 

Static only 

Static only 

yes 

yes 

Fault detection 

[2] [2012] Dima et al 

distributed 

Not mentioned 

Not mentioned 

yes 

no 

Fault tolerant 

[3] [2012] Dima et al 

distributed 

Not mentioned 

Not mentioned 

yes 

no 

Fault detection 

[5][2011] Fatima et 
al 

centralized 

Not mentioned 

Not mentioned 

yes 

no 

Fault Detection 

[6] [2013] Habib et al 

both 

Mobile & static 

static 

no 

no 

Fault tolerant 

[7] [2010] Jethro 

Shell et al 

distributed 

Not mentioned 

Not mentioned 

Not 

mentioned 

Not 

mentioned 

Fault detection 

[8] [2012] Kevin et al 

distributed 

Not mentioned 

Not mentioned 

Not 

mentioned 

Not 

mentioned 

Fault detection 

[10][2013] Md 

Zakirul et al 

Semi 

centralized/semi 

distributed 

Mobile 

Mobile 

yes 

no 

Fault detection 

[13][201 l]Peng Yu 
et al 

distributed 

Not mentioned 

Not mentioned 

No 

No 

Fault detection 

[14][2010] Shahram 
et al 

distributed 

static 

static 

No 

No 

Claims fault 

detection & 

recovery 

[15][2013] Sourour 
et al 

distributed 

Not mentioned 

Not mentioned 

No 

No 

Fault detection 

[16][2013] 
Vinaitheerthan et al 

distributed 

Not mentioned 

Not mentioned 

yes 

yes 

Fault detection 

L17] [2011J X. Liu et 
al 

distributed 

Not applicable 

Not applicable 

no 

no 

Fault tolerant 

model for faulty 
sensor reading 

[18] [2012] Xie 

Yingxin et al 

Not applicable 

Not applicable 

Not applicable 

No 

No 

Fault detection 


consensus phase means monitoring node sharing its findings 
with other neighbors. Second it executes duty cycle 
management. Third phase is neighboring management 
where each node maintains neighboring node updates. This 
helps to update network topology dynamics, nodes 
redeployment, nodes failures, nodes reconnection and so on. 

Based on certain threshold of link quality a node is 
considered neighbor. Fourth is messages protocol, node 
sends Heartbeat message (H) to check whether neighbor 
nodes is alive or not. Query message (Q) to inquire about the 
status of a suspicious node. Reject message (R) to indicate 
that a suspicious node is still alive. Acknowledge message 
(A) to acknowledge the reception of Reject message about a 
suspicious node. Authors have included almost all logical 
algorithms to function complete for fault management 
system. Authors failed to mention overheads in terms of 
network traffic & how it affect network lifetime. 

Similarly Md Zakirul et al [12] suggests an embedded 
algorithm for self monitoring of each node to check faults in 
its own behaviors & transmit result to the coordinators. Link 
Monitoring for 1-hop neighbors is done and transmit a 
behavior report to the coordinators. Finally these reports are 
sent to sink & sink in return may take any action for fault 


tolerance. Report about a link status is prepared by CMC 
(continuous time Markov chain), link failure probability etc. 
For observing faults of the node on its own in active mode, 
author define 3 process states. Preprocessing (P) is the initial 
state that waits or prepares for new tasks; Working (W) state 
that mainly processes the tasks; Idle (I) is the idle period of 
time during processing a task. Algorithm is embedded in 
node itself no involvement of sink. Each node individually 
monitors the links to its 1-hop neighbors. Finally all 
information is gathered at sink & remote monitoring center. 
It is just simulation in OMNeT++ & verified the results as 
proof of concepts assuming remote control car as mobile 
event. 

Many efforts have been taken by various researchers in 
different domain like communication[15], neural 
network[7][8][16][17][18], node monitoring & data 
aggregation techniques[5][6][10][13]. Few authors have 
opted for special fault management layer in between WSN 
Operating system & application [2] [3]. Adding additional 
checksum field [1] in network packet & verifying decoded 
tag information at sink may also help in fault management. 
Majority of papers are using Neural Network techniques. 
Both [15] [2013] & [17] [2011] uses the concept of value 
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fusion & decision fusion from parallel decentralized 
detection scheme, only the difference is [15] uses Bayesian 
approach whereas [17] uses natural frequencies & mode to 
detect faults. All mentioned algorithms in both papers are 
designed for a particular application [15] for storage of 
chemical products, [17] for civil structural health 
monitoring. Fault tolerant system should ideally give same 
output result, irrespective of faulty node. Proper coverage of 
network area will help in the same. [6][2013] Habib et al 
suggests two data gathering approaches in architecture of 
mobile, static nodes, static sink, mobile proxy sink. Author 
proved chain based approach perform better than direct data 
gathering approach. Decision of adding nodes in region of 
Interest will be based on local density that is number of 
mobile sensors in the communication range. 

Future Research Directions: 

Fault Management domain has immense scope to contribute. 
Fault Management domain consists of identifying abnormal 
behavior of network. Find out reason for this unaccepted 
output. Isolate the element responsible for this result. 
Applying some recovery techniques to help network in 
resuming normal behavior. Precautionary measures to 
reduce frequency of the same fault occurrence again. 

Fault Management from methodological point of view 
composed of steps like analysis of system, designing 
algorithms, testing of algorithms. From technological point 
of view components involved are nodes, links, and embedded 
algorithms. Other than this is environmental factor. 

Fault Management approach may be proactive or reactive. In 
proactive approach, after initial deployment we may inject 
some faults. Fault detection algorithms may be tested. 
Cyclically executing these algorithms may help to identify 
fault prior to any major disastrous effects. Fault prediction is 
most important step in proactive approach. 

In reactive approach, from abnormal behavior we come to 
know about existence of fault. Our fault detection algorithms 
will help to identify the root cause of fault & take corrective 
actions. We believe that fault management is a part of project 
design in software engineering life cycle. Lot of scope is there 
to work in fault prediction area. The scope of fault prediction 
may be extended to understand post occurrence effect of fault 
on the network. 

Our future research plan shall be to work on the research 
gaps mentioned in this survey paper. Next to design solution 
on these gaps & plan the scope of efforts required. Without 
disturbing basic objective of network deployment, provide 
various solutions for the betterment of the objective. 

We shall consider advantages of both proactive and reactive 
approaches, centralized as well distributed approaches. As 
shown in the figure 1 our first phase will follow centralized 
approach for deployment and configuration. Whereas for 
second phase implementation and execution we shall for 
distributed approach. 

III. Conclusion 

In Centralized approach due to heavy traffic at sink, there is 
possibility of communication hole problem & if sink itself 
fails how system will be stable is not discussed by anyone till 
date as per our reading. But coordination of all failure 
management system is easier compared to distributed 


approach. We are planning for hybrid approach which will 
result in best of both approaches. 

Most of the failure occurs due to remote deployment which 
leads into connectivity issues so we propose to start 
implementing failure management with deployment & 
connectivity issues i. e links among the nodes, coverage in 
network. Next possibility of occurrence of failure is 
hardware/ software/battery failure. 

Our intuition suggests if communication is limited to 
minimum number of hops power consumption will be less & 
hence possibility of failure due to battery consumption can be 
reduced. So our maximum transactions should be single hop. 
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